Milestone 2
26 Nov 2021Question 2
1
2.1 Shot Counts Histogram, binned by distance:
During the 2015-2018 seasons, shot and goal count together is highest when taken between 5 to 60 feet. Number of shots taken further than 60 feet away decrease significantly. Most shots were taken between 10-15 feet, being 30124. Most goals were scored between 10-15 feet, being 5997. Shots shot in all other distances count up to less than 2000 per bin. Goals scored in all other distances count up to less than 400 per bin. Between 0-5 feet there is very low shot count. This could possibly be because when the shooter is too close to the net, the shooter is too close to the goalie so is hard to angle a shot into the net.
2.1 Shot Counts Histogram, binned by angle:
During the 2015-2018 seasons, shot and goal count together based on angle taken appears to look like a bimodal distribution. The most shots were taken between 60-65 degrees, and 115-120 degrees, being 15506 and 15958, respectively. There is no clear angle bin that had the most goals, as number of goals mostly ranged between 1300-1800 between angles of 60-120 degrees. Shots taken at less than a 45 degree angle and greater than a 140 degree angle start to rapidly degrees in count. Goal count as well.
2.1 2D Histogram:
This 2D histogram has 1 square bin per distance bin and angle bin. The darker shade in blue for each square bin, the higher in count. We can see that the darkest patches are when both the distance histogram and angle histogram are at their peaks. For example, when distance is at around 10-15 feet and angle is at either 60-65 degrees or 115-120 degrees. At distances and angles that are low in shot/goal count, the 2D histogram is white because there are no instances of shots taken at far distances and really far off angles.
2
2.2 Goal Rate to Distance:
Goal rate is highest between 0-5 feet (~0.31), and starts to drop drastically until 70 feet (~0.03). 70 feet is also around when shot count drops significantly. This makes sense as the closer the shot is taken, the less time for the goalie to react to the shot, so the higher chance of scoring a goal. In addition, although goal percentage for shots taken beyond 70 feet fluctuate and are sometimes higher than when shots were taken closer than 70 feet, the sample sizes are too small so the goal percentage for shots taken beyond 70 feet are probably not that accurate and should be disregarded.
2.2 Goal Rate to Angle:
Goal rate is highest around 95-100 degrees (~0.13). According to shot/goal count with respect to shot angle, shots taken between 40-140 degrees are highest in count, which means it has the largest sample size. Goal rate in this angle range ranges between ~0.07 to ~0.13, which is what we should focus on. In the 40-140 degree range, 80-100 degrees has the highest goal rate ranging between ~0.12 to ~0.13. This makes sense as shots taken more directly at the goal post have more area for the goalie to cover, so is harder for the goalie to save the shots, which means more likely to score the goal.
3
2.3 Goals only, binned by distance, separated by empty net and non-empty net events:
The most goals were scored between 10-15 feet, being 5997 (98 empty net, 5899 non-empty net). Goal count drops down rapidly as goal distance increases for both empty and non-empty net. This makes sense as there should be a higher chance to score the closer the shooter is to the goal post. However, it is strange to have a small peak of goal counts around 170-175 feet away. That is almost shot from the distance of the net from the other side. We investigated this and found that the x and y coordinates were actually logged incorrectly - that is, logged on the wrong side. This can be proven by game_id 2015020671, evenIdx 404. On January 17, 2016, in the Canucks vs Islanders game during penalty shootouts, Radim Vrbata (Canucks) scored a goal on Jaroslav Halak (Islanders) right in front of the net, but the coordinates were logged to be on the opposite side. Clearly, this was a mistake in the logging. See https://www.youtube.com/watch?v=xQjKUsl1a9I at time 5:11 for reference.
Question 3
1
We got a validation accuracy of 90.4%. This is quite high. We investigated the model’s predictions on the validation set and realized that it is predicting every shot to not be a goal. A potential explanation for this is class imbalance. In all of the training data, there are 311106 y values. There are 29187 goals and 281919 shots, being 0.094 and 0.906 of the total y values, respectively. This could make the model biased and achieve a high accuracy of ~90 just by predicting every shot to not be a goal. Other potential explanations are that the features either aren’t very correlated with the y-values, the model wasn’t able to learn anything from the features, or both.
2
(no question)
3
The ROC (Receiver Operator Characteristic) curve shows the discriminative ability of binary classifiers. The TPR (true positive rate) is plotted against the false positive rate (FPR). So, binary classifiers that have a ROC curve that is closer to the top left corner are better performing. Correspondingly, the higher the AUC (area under the curve), the better performance of the classifier. The Logistic Regression model trained on [distance from net] and [distance and angle from net] tied and achieved the best ROC curve with AUC score of 0.67, as can also be seen from the red (covered) and pink curves as if being “pulled” close towards the (0, 1.0) corner. The logistic regression model trained on [angle from net] achieved a AUC score of 0.5, which is the same as the random baseline, so is clearly inferior to the previous 2 models.
We would like for the case to generally be that the higher the probability model percentile, the higher goal rate. This is shown in the cases of the model trained on [distance from net] and [distance and angle from net] as it shows that the model’s higher predictions of a shot to be a goal is positively correlated with actual goal rates. The model trained on [angle from net] does not show any correlation between shot probability model percentile and goal rate, so it indicates that the model was not able to learn to predict with high accuracy from just the [angle from net] feature.
It makes sense too see the cumulative proportional of goals curve increase as shot probability model percentile increases. All curves should eventually increase to 1.0, which is shown. However, the curve for [distance from net] and [distance and angle from net] increase at a non-uniform pace. As shot probability model percentile increases, the rate at which the cumulative proportion of goals increases also increases. This makes sense as the higher shot probability, the more goals there should be. Thus, the cumulative proportion of goals will increase faster.
Perfectly calibrated probabilities of a model is the y=x curve. Therefore, curves that more similar to the y=x curve are better calibrated. Logistic Regression models trained on [distance from net] and [distance and angle from net] achieved very similar calibration curves near y=x. Their predicted probabilities are better calibrated than the model trained on [angle from net] which turned out to be only got 1 point on the plot.
4
Logistic Regression, trained on distance only: comet link
Logistic Regression, trained on angle only: comet link
Logistic Regression, trained on both distance and angle: comet link
Question 4
5
#NOTE the csv is saved in /assets/milestone2
The list of all of the features:
- eventIdx: unique event identifier per game
- game_id: unique game identifier
- Game Seconds: number of seconds that passed in the game
- X-Coordinate: the x-coordinate of where the event occurred
- Y-Coordinate: the y-coordinate of where the event occurred
- Shot Distance: the distance from where the shot was taken from
- Shot Angle: the angle from where the shot was taken from
- Shot Type: the type of shot taken
- Was Net Empty: whether or not the net was empty when a goal was scored
- Last Event Type: what the last event was
- Last X-Coordinate: the x-coordinate of where the last event occurred
- Last Y-Coordinate: the y-coordinate of where the last event occurred
- Time from Last Event (seconds): the number of seconds that passed since the last event
- Distance from Last Event: the distance from the location of the last event
- Is Rebound: whether or not the current event is a rebound (by considering whether or not the last - event was also a shot)
- Change in Shot Angle: the difference in shot angle from the last event (only if this shot is a rebound)
- Speed: the distance from the distance event divided by the time since the previous event
- Is Goal: whether or not this event resulted in a goal
Question 5
1
distance-only: comet link
angle-only: comet link
distance and angle: comet link
We used sklearn.model_selection.train_test_split to split the data into 80% train and 20% validation. For the XGBoost classifier trained on [distance from net], [angle from net], [distance and angle from net], we got validation accuracies of 90.39%, 90.40%, and 90.39%, respectively. This is the same as the validation accuracies for the Logistic Regression classifier in question 3 trained on [distance from net], [angle from net], [distance and angle from net].
Comparing ROC curves, for the features [distance from net], [angle from net], [distance and angle from net], Logistic Regression got AUCs of 0.67, 0.50, 0.67, respectively, and XGBoost got AUCs of 0.68, 0.62, 0.70, respectively. So, we can see that XGBoost turned out to be a better performing classifier than Logistic Regression for these cases.
The goal rate for all 3 XGBoost curves trend upwards as shot probability model percentile increases. This is an improvement over the Logistic Regression model as Logistic Regression’s goal rate curve did not improve when trained using [angle from net]. So, this suggests that XGBoost was able to learn something from the data from all 3 different subsets of features, since all 3 curves increased rather than stayed at the same goal rate.
The cumulative proportion of goals for all 3 XGBoost curves started off increasing at a slow pace, and then increased as shot probability model percentile increased. This is consistent with the information from the goal rate curves as a good performing model should should predict higher shot probability when the actual goal rates are also higher. Since all 3 XGBoost curves are increasing at an exponential rate, it is an improvement over the Logistic Regression curves, where the rate of the model trained using [angle from net] increased at a constant pace.
The calibration curves for XGBoost trained on [distance from net], [angle from net], [distance and angle from net] were all very close to the perfectly calibrated line where y=x until around when mean predicted probability was 0.2. This means the model was calibrated well up to 0.2 mean predicted probability. All 3 XGBoost curves also appear to be better calibrated than the Logistic Regression’s 3 curves.
2
We used grid search to tune the hyperparameters of the XGBoost model when trained using all the features of the data. We searched over the parameter space for the number of estimators, max depth, learning rate, and the type of booster. The best validation accuracy we got was 91.27%, which was when booster is gbtree, learning rate is 0.05, max depth is 10, and number of estimators being 100. The ROC curve shows more improvement over the XGBoost model trained using only [distance from net], [angle from net], or [distance and angle from net]. XGBoost trained using all features achieved an AUC of 0.77, which beat the previous highest XGBoost’s AUC of 0.7. The goal rate curve also looks even steeper as shot probability model percentile increases, especially as we approach 0.8. The same result is shown in the cumulative proportion of goals curve. The curve becomes even steeper as we approach 0.8 shot probability model percentile. For the calibration curve, this XGBoost curve seems to be well calibrated until around 0.4 mean predicted probability, which is an improvement over the previous XGBoost models, which was only well calibrated until around 0.2 mean predicted probability. So, judging from these results, this XGBoost that was trained using all features as well as having its hyper parameters tuned is an overall improvement over the XGBoost baseline model.
comet link
3
Technique 1: Removing features with low variance
- By setting a variance threshold, we can calculate the variance of all the features and remove the ones who do not meet the threshold.
- comet link
Technique 2: Univariate Selection
- Select the features that have the strongest relationship with the output variable based on some statistical test. We used the chi-squared test to compute the chi-squared stat between each feature and class. We then select the k features that had the highest scores.
- comet link
Technique 3: Recursive Feature Elimination
- Features are recursively removed until the specified number of features that want to be kept. Features are recursively removed based on their importance scores. Importance scores are calculated based on how much each feature contributes to predicting the target class.
- comet link
Technique 4: Tree-based Feature Selection
- Use tree-based estimators (in our case, the Extra-tree Classifier, which is an extremely randomized tree classifier) to compute impurity-based feature importances. Then, sort the features based on their importance level and then only keep the most important ones.
- comet link
The optimal set of features was using Univariate Selection. The original 18 features got reduced down to 14, specifically [‘eventIdx’, ‘game_id’, ‘Game Seconds’, ‘Game Period’, ‘Y-Coordinate’, ‘Shot Distance’, ‘Shot Type’, ‘Was Net Empty’, ‘Last Event Type’, ‘Last Y-Coordinate’, ‘Time from Last Event (seconds)’, ’Distance from Last Event’, ‘Is Rebound’, ‘Speed’]. When tested on a Logistic Regression model, these features improved the validation accuracy from 90.39% to 91.23%.
Question 6
1 Figurs and Discussions
Approach 1: Different model type: Decision Tree Classifier
A supervised machine learning classification model where data is split at each level depending on some parameter until it is categorized as some class. Using the scikit-learn DecisionTreeClassifier with default parameters, we got a validation accuracy of 84.39% and auc of 0.57.
ROC
Goal
Cumulative
Calibration
Approach 2: Hyperparameter Tuning: Decision Tree Classifier with Randomized Search on Hyperparameters and Regularization
We tuned the hyperparameters of the decision tree model using randomized search. We searched through the splitter, max_depth, min_samples_split, min_samples_leaf, max_features, max_leaf_nodes parameters and got a validation accuracy of 90.82%. The optimal parameter settings were: ‘splitter’: ‘best’, ‘min_samples_split’: 0.9, ‘min_samples_leaf’: 0.1, ‘max_leaf_nodes’: 2, ‘max_features’: 4, ‘max_depth’: 48.
ROC
Goal
Cumulative
Calibration
Approach 3: More advanced feature selection strategy: Decision Tree Classifier with PCA feature reduction
We used PCA to reduce the features to 3. We got a validation accuracy of 82.86% using the Decision Tree Model after.
ROC
Goal
Cumulative
Calibration
Approach 4: Approach 4: Different model type: Multilayer Perceptron Classifier
The Multilayer Perceptron classifier is a type of feedforward artificial neural network. We got a validation accuracy of 90.82% and AUC of 0.5.
ROC
Goal
Cumulative
Calibration
Summary
Both approaches 2 and 4 achieved validation accuracy of 90.82% and AUC of 0.5. These are our best ‘final’ models. However, I would pick the MLP to be our best model over the Decision Tree Classifier with Randomized Search on Hyperparameters and Regularization because does not require additional hyperparameter tuning to achieve the same result.
2 Models links
Approach 1
Approach 2
Approach 3
Approach 4
Question 7
1 - Regular Seasons
For the Logistic regression models trained on [distance from net], [angle from net], and [distance and angle from net], testing on the untouched 2019/20 regular season dataset yielded 0.01 better AUC of 0.68, 0.51, and 0.68, respectively, compared to the training set’s validation AUC of 0.67, 0.50, and 0.67, respectively. The best XGBoost model saved in part 5 achieved a AUC value of 0.7 for this test set. The AUC for the test set here (0.54) is a lot worse than the AUC from the validation set tested earlier. The best overall model from part 6 (MLP) achieved an AUC value of 0.50 for this test set. It is the same result compared with the AUC from the validation set tested earlier (0.50), which is normal to see, but a terrible result.
The goal rate for the three logistic regression curves on the test set is more or less the same result as when it was tested on the validation test. The best XGBoost model does not seem to have learned anything when we look at its goal rate curve as it does not have a clear upward trend. This is the similar to the best XGBoost curve tested on the validation set. The MLP goal rate curve was almost always at 0. This indicates that the model did not learn anything about the data to make good predictions on class probabilities.
The logistic regression models tested separately on [distance from net], and [distance and angle from net] appear to have the best cumulative goal rate curves because they are most curved, which indicates they learned something from the features of the data. Logistic regression testing on [angle of net] and the best XGBoost model both have very similar curves to the y=x curve, which indicates that they did not make good predictions on probabilities of classes. MLP was even worse, as the cumulative proportion of goals curve on the test set was almost always at 0.0.
ROC
Goal
Cumulative
Calibration
2 - Playoffs
For the Logistic regression models trained on [distance from net], [angle from net], and [distance and angle from net], testing on the untouched 2019/20 regular season dataset yielded 0.01 better AUC of 0.68, 0.51, and 0.68, respectively, compared to the training set’s validation AUC of 0.67, 0.50, and 0.67, respectively. The best XGBoost model saved in part 5 achieved a AUC value of 0.7 for this test set. The AUC for the test set here (0.54) is a lot worse than the AUC from the validation set tested earlier. The best overall model from part 6 (MLP) achieved an AUC value of 0.50 for this test set. It is the same result compared with the AUC from the validation set tested earlier (0.50), which is normal to see, but a terrible result.
The goal rate for the three logistic regression curves on the test set is more or less the same result as when it was tested on the validation test. The best XGBoost model does not seem to have learned anything when we look at its goal rate curve as it does not have a clear upward trend. This is the similar to the best XGBoost curve tested on the validation set. The MLP goal rate curve was almost always at 0. This indicates that the model did not learn anything about the data to make good predictions on class probabilities.
The logistic regression models tested separately on [distance from net], and [distance and angle from net] appear to have the best cumulative goal rate curves because they are most curved, which indicates they learned something from the features of the data. Logistic regression testing on [angle of net] and the best XGBoost model both have very similar curves to the y=x curve, which indicates that they did not make good predictions on probabilities of classes. MLP was even worse, as the cumulative proportion of goals curve on the test set was almost always at 0.0.
ROC
Goal
Cumulative
Calibration




